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ABSTRACT 

This thesis presents an introduction to Hidden Markov models (HMM) and their 
applications to classification problems. HMMs have been used extensively to model the 
temporal structure and variability of speech and other signals in the last decade. We 
selected to write our own HMM implementation in MATLAB. We tested our software 
on a limited isolated 4-word recognition. We also applied our implementation to the 
recognition of mine-like objects buried in shallow sand, using seismo-acoustic data 
obtained from an on-going project at the Naval Postgraduate School. Initial results 
indicate that the HMM-based classifier can recognize the type of mine-like object, 
independent of the object weight with a 97% accuracy. Results also indicate that it can 
recognize the object type at different distances with a 100% accuracy. However, the 
experiments were conducted with very few data, and further work needs to be done to 
confirm these initial findings by using a larger data set. Finally, we benchmarked our 
results against those obtained using a back-propagation neural network implementation, 
which were found to be similar, but slower than the HMM-based implementation. 
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I. 



-INTRODUCTION 



This thesis presents an introduction to Hidden Markov models (HMM) and their 
applications to classification problems. HMMs have been used extensively to model the 
temporal structure and variability of speech and other signals in the last decade. For 
example, they have become a major player in the speech area [4, 5, 9]. We selected to 
write our own HMM implementation even though sophisticated HMM software is readily 
available on the market to better understand the basic concepts behind the theory. Our 
implementation uses MATLAB because it is a high-level language easy to work with. 
As a result, the software may not be as fast as that obtained with a compiled 
implementation but it is easy to understand, which was one of the main goals of the 
research. We tested our software on a limited isolated 4-word recognition, and we also 
applied our implementation to the recognition of mine-like objects buried in shallow 
sand, using seismo-acoustic data obtained from an on-going project headed by Prof. Muir 
from the Physics Department at the Naval Postgraduate School. Finally, we 
benchmarked our results against those obtained using a back-propagation neural network 
implementation. 

Chapter II introduces the concepts of HMMs in an “engineering-oriented” simple 
fashion. We illustrate the theoretical concepts with basic examples to facilitate the 
understanding of this difficult topic. We cover the three specific problems which HMM 
address, their solutions, emphasizing the computational savings obtained with the 
forward, backward, Viterbi and Baum-Welch algorithms. 
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Chapter HI presents the application of HMMs to a simple speech classification 
problem, using four isolated three-syllable words: “Microsoft,” “Statistics,” “Instructor” 
and “Professor.” Our goal is to show how a generic classifier can be set-up through the 
simple speech recognition example. We introduce the concept of vector quantization 
(VQ) applied to generate discrete symbols from the speech feature vectors created with 
Linear Prediction Coefficients (LPC) and energy coefficients. Two different 
implementations of VQ are considered 1) a Neural Network (NN)-based VQ scheme; and 
2) a K-means algorithm [9]. In addition, we point out the potential numerical difficulties 
which can be encountered while setting up and implementing the HMM software. 

Chapter IV considers the application of the HMM-based classifier to the 
classification of two mine-like objects buried in sand; a cylinder and a powder keg with 
weights ranging from 71kg to 290kg. Results show that recognition performance is good 
under various conditions of weight and distance of the object. 

Chapter V presents a back-propagation neural network classifier designed to 
recognize the two mine-like objects discussed in Chapter IV. This implementation was 
conducted on the same data to compare the performance of the two schemes. Results 
show the performances to be similar (around 95%), but the HMM-based implementation 
procedure is faster than the NN-based implementation. 
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II. 



INTRODUCTION TO HIDDEN MARKOV MODELS 



This chapter introduces the basic theory of Hidden Markov models (HMM), 
which are a very powerful technique for modeling the temporal structure and variability 
in speech and other applications. The understanding of HMM is very important in order 
to use them properly and achieve the best results for signal classification. 

A. HMM BACKGROUND 

Hidden Markov Model theory was first introduced in the late 1960s and early 
1970s by Baker at Camegie-Mellon University and Jeninek and colleagues at IBM for 
speech recognition [6]. Since then they have been used extensively for speech 
applications, and also successfully to other tasks such as human face identification, lip 
and speech-reading, optical character recognition and time DNA modeling ([6] and 
references therein). The reason for this wide range of applications is the rich 
mathematical structure the HMMs are built on, yielding optimal results if used properly. 
A detailed overview of HMM theory was presented by Rabiner in the late 1980s [3, 4]. 

There are three main reasons why we may need to model a signal. First, we apply 
signal modeling to mathematically describe the signal, so that we’ll be able to process it, 
for example to denoise a speech signal. Models are also important because they let us 
describe the signal source, which doesn’t have to be directly available to the user. For 
example we cannot produce real seismic waves without an earthquake, but we can create 
models of such signals and then process them. Finally, the most important reason for the 
widely spread use of signal models is that they perform well in practice [3]. 

There are two types of signal models; deterministic and statistical. Deterministic 
modeling is applied when dealing with signals with known physical characteristics. For 
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example a sinewave is completely specified by its frequency, phase and amplitude. 
Stochastic modeling is applied when one tries to characterize only the statistical 
properties of the signal. For example, statistical modeling may include Gaussian Poisson 
and Markov process to describe events. Usually, real applications use both deterministic 
and stochastic modeling. In this work we focus on statistical signal modeling, using the 
Hidden Markov Model (HMM). 

B. INTRODUCTION 

The HMM theory is based on the Markov Chain. We can define the Markov 
Chain as a probabilistic description of transitions between a system’s states. A state can 
be a property, or generally a condition, that a system/model might have at a particular 
instance. The HMM consists of an underlyining Markov chain describing the 
probabilistic status between the states, as that shown in Figure 2.1, which illustrates a 
three state left-right model. For example, suppose we want to model a speech signal. 
First, the signal is split into T time frames. Then, a set of parameters (such as LPC 
coefficients, energy, etc...) is extracted together with a set of symbols for each time 
frame. The sets of symbols represent each frame characteristics, and are called 
observations. As a result, the entire model is a sequence of symbols, and each symbol is 
a system model depicting each segment. In most cases we choose the segment length 
empirically, but sometimes adjust it so that it is large enough to contain all the 
information (usually spectral) that makes it unique, by comparison with the other 
segments, due to possible change in the signal behavior. Generally, we can assume that 
we reach an optimal number of frames when, by decreasing the frame length, we generate 
models identical to those already generated. At this point, HMMs take advantage of the 
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properties existing between adjacent segments properties by addressing the following 
three problems [3]: 1) how to identify the characteristic frames; 2) how to characterize 
the relation between all successive segments; and 3) what types of properties should be 
extracted to model each segment. 

C. DISCRETE MARKOV PROCESS 

An accurate definition for the HMM according to Rabiner [3] is that “A HMM is 
a doubly stochastic process that is not observable (it is hidden), but can only be observed 
through another set of stochastic processes that produce the sequence of observed 
symbols.” 

Consider a system which exists at time t in one of N possible potential states, as 
illustrated in Figures 2.1 and 2.2, where N=3. Each of the three circles represents a state 
of the model. At a specific discrete time instant t within frame k, the model is always at 
one of those three states, and we observe the sequence Ok. Generally, given that the 
system has N states S = {si, S 2 , . . . ,sn}, where sj is the j*^ state, at every time t=k, the 
model passes through the sequences of states Q = {qi, q 2 , . . . ,qt}, where qk is the state at 
time k. The model that describes all this information is called a Markov chain. As we 
can see from the example in Figure 2.1, 

( 2 - 1 ) 

which means that if the model at time t=k is at state, it can remain at the same state at 
time t=k+l. Note that no backward transitions are allowed in Fig 2.1. As a result, this 
model is called a “left-to-right” model (the model is called ergotic when backward 
transitions are allowed, as illustrated in Fig 2.2). 



5 



We define the probability ay as the transition probability from state i to state j. 
For example, ^23 is the probability of going from the 2“** to the state, and is defined as: 

a,j = P[q, = j\q, - 1 = /] , 1 < i,j < N , (2.2) 



with the following constrains: 



aij>0 V/,;, 

f,au = l Vi, 
y=i 



(2.3) 



since all probabilities are positive numbers and the summation of all transition 
probabilities is 1 . 
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Figure 2.2 A Markov Process (Chain), ergotic model 
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For the discrete-time case, a system governed by known or predictable dynamics 
can be modeled using a Markov chain. The probability of observing the observation Ot, 
given the model X at a time t, is determined by the state at the time preceding states [6]: 

P{Ot\X) = P\qt = sj,qt-\ = Si, qt-i = sk, . . . ] . (2.4) 

For a first-order Markov chain, eq (2.4) can be simplified to: 

P(a|A) = nPt, l<r<r. (2.5) 

k=\ 

Finally, we define the initial conditions, the probabilities of starting (t=l) at the i*^ state sj 
as: 

7ti = P[q^=s,\ i = l,2, -N. (2.6) 

1. Example 1 

Let’s evaluate the probability of the observation O = {si, Si, S3} for the example 
shown in Figure 2.2. So, applying eqs (2.4) and (2.5) to the example shown in Figure 2.1 
leads to: 

Pr(o|A)=P[/'j. 5., 5,;|/l] 

Pt(0|A) = f[53|52}p[52|5i]p[5i] 

Pr(o|/l)=;r,a,3a,2. 

2. Example 2: Browser Tracking. 

Next, let us consider the following example where a user wishes to track the 
browser type used by visitors to the Web. Assume: 1) only three types of browsers are 
available: Microsoft Explorer (MIE), Netscape (NET) and America Online’ s browser 
(AOL); 2) the specific browser type version is not taken into account. 
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Table 2.1 Browser Tracking 



User# 


1 


2 


3 


4 


Browser 


MS Explorers. 0 


AOL4.0 


Netscape3.0 


Netscapeb.l 


Browser 

Observation 

Symbol 


1 


2 


3 


^ ! 



An observation symbol is assigned to each visitor’s browser types, as illustrated in Table 
2.1. Note that we assign the same symbol (3) to both versions of Netscape, as the 
specific browser version in not tracked by the user. In addition we assume transition 
probabilities which represent probabilities of going from one state to another, i.e., 
browser, to another are available from previous statistical studies: 



a = 



0.3 0.4 0.3 
0.2 0.5 0.3 
0.8 0.1 0.1 



Thus, the resulting observation sequence is 0={ 1, 2, 3, 3}. Finally, let’s assume that the 
probability the first user uses the explorer browser is tii=0.5. 

Therefore: 

P{0 1 A) = F[1, 2, 3, 3 1 a] = P[1]P[2 \ l]p[3 \ l]p[3 \ 3] = 

=j> F(0 I A) = 0.5 • 0.2 - 0. 1 • 0. 1 = 0.001 . 

3. Example 3: Urn-ball Example 

Note that the states are directly observable in the previous examples. However, in 
most real world cases, the state sequence that produces a given sequence of patterns 
cannot be determinated, and the model is said to be “hidden”. The most common 
example for this type of description is the um-ball model, according to Fig 2.3: 
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Figure 2.3. The classic Um«ball example. There are three (N=3) urns that contain a large 
number of color balls. The colors of the balls are red, green, and blue. The boy we assigned to 
paint the balls runs out of blue paint sometime while painting and uses sky blue afterwards. Thus, 
eventhough there are 4 colors, we assign the same symbol for blue and sky blue and the number of 
symbols is three (M=3). Each time a person randomly selects an um, he selects a ball, and 
announces the color. The process is repeated 4 (T=4) times. Note that: 1) we don’t know from 
which um the ball came from, as we only have a color sequence, and we map those colors with the 
symbols according from Table, 2.2; 2) we assume that it is a left-to-right model for simplicity. 



Table 2.2. Um-ball example. T, Q, C, O values parameters 



Clock Time T 


1 


2 


3 


4 


Um (Hidden State) Q 


qi 


qi 


q2 


qs 


Color C 


Red 


Blue 


Sky Blue 


Green 


Observation Symbol Os 


1 


2 


2 


3 



Table 2.2 shows that we 
is N=3 and each um represents 



have T=4 observations of four balls. The number of urns 
one of the 3 hidden states, because we don’t know from 
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the specific um number each ball comes from. Further, we consider that the sky blue 
color belongs to the same class as blue, so we assign the same observation symbol (Os=2) 
for both colors, and the total number of symbols is M=3. The possible observation 
symbols are V={ 1, 2, 3}, and the states are Q = {qi, qi, qa}. We also assign values to the 
transition probability matrix A={aij}, and initial conditions vector n={7ti}: 



A — {fly}— 



0.5 0.3 0.2 

0 0.8 0.2 , 

0 0 1 



n = {;7;}={0.9, 0.1, 0}. 



Note that the specific upper triangular structure of A indicates the model considered here 
is left-to-right, as we cannot go backward. In addition, we assume that the probability of 
selecting the first um is 90%. Finally, we also need the observation symbol probability 
distribution at every state, to describe the model completely. This information is 
presented in the matrix B, which contains the probabilities of being at the state j and 
observing the symbol k, such as: 

B = W, l<j<N, 
with the following properties (like A): 

bjk >0 

M 

k=\ 

Note that B is a NxM matrix where N represents the number of hidden states and M the 
number of symbols. Matrix B isn’t restricted to be square. In our example, a possible B 
matrix may be defined as: 



\<k<M , 



vy. 



(2.7) 



( 2 . 8 ) 
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B — bjk — 



0.3 0.5 0.2 
0.4 0.4 0.2 
0.6 0.3 0.1_ 

The statistical model is completely described by the set of matrices A, B, and the 
vector 7t and usually denoted as: 

X = {a,B,4 (2.9) 

Note that all rows of matrices A, B and n shown sum-up to 1, thereby providing an easy 
check on the validity of the model. Extending our example to a general case, Table 2.3 
lists the definition of all HMM parameters, as given by Rabiner [3], for a generic HMM 
model: 

Table 2.3 HMMs elements 

• T: Length of the Observation Sequence (total number of clock times) 

• N; Number of States in the Model 

• M: Number of observation symbols 

• Q={qi, q 2 , . . . , qN>: States 

• V={vi, V 2 , . . . , vm}: Discrete Set of Possible Symbol Observations 

• A = {ay}, aij = Pr{qj @ t+l|qi @ t}: State Transition Probability 
Distribution 

• B = {bj(k)}, bj(k) = Pr(Ot=Vk|qt=Sj): Observation Symbol Probability 
Distribution in State j 

• % 71, = Pr(qi @ t = 1): Initial State Distribution 
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D. THE THREE BASIC PROBLEMS FOR HMM’S 

Three basic problems of interest must be solved in order to specify the model X = 

{A, B, 71} , and use it in classification applications [3]: 

Problem 1: How to compute the probability of the observation sequence Pr(0|X), 
for O = {Oi, O 2 , . . . , Ot}. 

Problem 2: How to compute the most optimal state sequence I = {ii, ia, . . . ,ix}, 
for a given observation sequence O = {Oi, O 2 , . ■ ■ , Or} and model X. 

Problems: How to adjust the model parameters X = {A, B, 7t} to maximize 

Pr(0|X). 

Problem 1 is an evaluation problem, because we want to evaluate the probability 
Pr(0|X), for the specific model. The hidden part of the model is the state sequence I that 
we attempt to find in Problem 2. Note that there are many possible state sequences, but 
only one is optimal. Finally in Problem 3, we adjust the parameters A, B, and tc of the 
model X, so that the probability Pr(0|X) is maximum, i.e., we attempt to optimize the 
model X. 

E. SOLUTIONS TO THE THREE HMM PROBLEMS 
1. Problem 1: Evaluation of Pr(0]X) 

Problem 1 deals with evaluating the probability of the observation sequence O, 
given the model X, i.e. Pr(0|X). Using basic probability principles, it is the summation of 
the conditional probability Pr(0,I|X) over all possible state sequences I [3]: 

Pr(0 1 A)= XPr(0 1 7,A)Pr(/ |/l), (2.10) 

all! 
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This evaluation first requires the definition of Pr(0,I|X), the joint probability that the 
observation 0={0i, O2, . . . ,Ot) and the state sequence I={ii, i 2 , . . . , ir} occur at the 
same time, given the model X [6]. Using the Bayes’ rule it is computed as: 

Pr(0, / I i) = Pr(0 I /, A)Pr(/ | A) . (2.1 1) 

Since we assume independence of the observation, we can write: 

T 

Pr(0 I /,i) = n I i‘,A) =bi, {Oi)bi, {O2). ..bi^ {Ot) , (2.12) 

(=1 

and 

= (2-13) 

Therefore, replacing eqs (2.10 ), and (2.13) into (2.1 1 ) leads to: 

Pr(0 |i)= (2-14) 

all i 

As an example of this computation. Let’ s apply the above result to the um-ball example 
considered earlier in section C and in Figure 2.3. The components of the model 

X{A, B, n) are: 





'0.5 


0.3 


0.2" 




'0.3 


0.5 


0.2‘ 


A = 


0 


0.8 


0.2 




0.4 


0.4 


0.2 




0 


0 


1 




0.6 


0.3 


0.1 



;r = {0.9, 0.1,0}. 



The observation sequence is: 



o=a 2, 2 , 3 ;, 
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with the number of states N=3, clock times T=4, and the symbol dimension M=3. Recall 
that we don’t know the state sequence, i.e., we cannot identify which specific urn, the 
color balls came from. Using eq (2.14) we can find the hidden state sequence I, by 
picking that which leads to the highest probability Pr(0,I|X). For example, possible state 
sequence that we can use are I={ 1,1,1,!} (which means that all balls come from the first 
urn), or I={ 1,1,1,2}. Note that this model is a left-to-right model, as the state sequence 
doesn’t go backwards, thus, for instance the state sequence I={ 1, 3,2,2} is not valid as we 
can’t go from the 3rd to the 2nd state. Therefore: 

for / = / 1 , 1, 1 , 2 }: (first 3 balls from F' um, 
last one from 2"^ um) 

7Ti ifoi (0 1 )rZni 2 (O 2 ili'ibi'i (0^) • • • atr - uTbaifilT ) 

— 7tilbh{\^iii2bi2(2^ai2i3biy{2^* * ‘ OiT - xirbiT^^ 

= 0.9 • 0.3 • 0.5 • 0.5 • 0.5 • 0.5 • 0.3 • 0.2 

= 0.0010125 



for I = {3,3, 3,3 }: (all balls from 3'‘‘ um) 
Ki\bi\ifD'^ai\i2bi2{0‘2fli2i')bi2{f)'^‘ • • atr - urbirifilT ) 

= 7tiibii(li^aiii2bi2{2^ai2i3bi2{2^* * * atr - uTbtr({3'^ 

= J^'i^3i<333&32<333^32<333^33 

= 0.90.610.31-0.31-0.1 
= 0.00486 



As we can see, the number of combinations of all possible sequences is very 
large, even though the number of the observations T is relatively small. In this problem 
(T=4), the direct computation of Pr(0|X) requires (2T-1)N^ multiplications and N^-1 
additions, which is too expensive for real applications. For instance, the number of 
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computations is (2*100-l)3'°°+3"“-l when T=100 and N=3. Therefore, the more 
efficient forward-backward procedure was introduced to solve this problem [3, 6]. 

a) Forward Procedure: 

We define the forward variable, at(i), as the probability of the partial 
observation sequence {Oi, O 2 , . . . ,Ot}, until time t and state qi at time t, given the model 
X., such as: 

a,(i) = Px{0„0^.....0,,i, =q\X). (2.15) 

The forward algorithm defined below is used to evaluate all possible Ot(i) variables: 

Initialization: 

a,(i) = Tz,b,(0,h \<i<N (2.16) 

Recursion: 

a,Jj) = [f,oi,(i)ay ]b/0„, ) fort = 1, 2, ...,7-1, l<j<N (2.17) 

1=1 

Termination: 

PT{0\A)='^^a^(i). (2.18) 

Eq (2.18) is also know as the Baum- Welch probability. As we can see 
from the recursion, the forward procedure gets its name from the fact that Ut+i is evaluated 
using the previous value of the forward variable CXt. Computing Pr(0|X) using this 
procedure requires only calculations instead of the 2TN^-1 required by the direct 
definition [3]. For example direct computation requires 2TN =7.4- 10 computations, 
while the forward procedure requires N^=1920 for N=8 and T=30. 
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Figure 2.4a Trellis Diagram for T=4, and N=3, ergotic model 
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: Current position (State, Time). Total number of positions=N»T 



: Possible movement (for the left-right model they are less) 




:Most likely path 



Figure2.4b Trellis Diagram for T=4, and N=3, left-to-right 
model as used for example in Figure 2.3. Fewer possible paths 
than in ergotic model are allowed. 



The computational savings obtained in the forward procedure are 
illustrated in Figure 2.4a and 2.4b which present all possible combinations of state 




sequences from clock time ti to U in a trellis diagram. Figure 2.4a presents the trellis 
diagram for an ergotic model, and T=4, N=3. Each calculation of Ot(i) with the forward 
procedure only requires the computation of the values at.i(j), for l<j^. Note that the 
procedure discards the routes less likely to occur, so that the probabilities through the 
previously discarded routes do not get reevaluated at the next iteration time. 

Finally, the probability Pr(0|X) is obtained by the summation in eq (2.18). 

b) Backward Procedure: 

Similarly, we define the backward variable Pt(i) as the probability of the 
partial observation sequence from t+1 to the end, given q; at time t and the model X [3], 
where: 

= . . . Or\ir =Qn ^)- (2-19) 

Pt(i) is computed recursively from t=T down to t=l as follows: 

Initialization: 

Pr(i) = \, i<i<N . (2.20) 

Recursion: 

= t = T-\,T-2,...,l , l</<iV . (2.21) 

>=i 

Note that the backward variable is issued to re-estimate the model X (Problem 3), and that 
the number of computations is decreased to N^. 

Therefore, using the definitions of the forward and backward variables, 
and according to [6]: 

Vx{0\X)='^a,(i)li,(i), t = \,-,T. (2.22) 
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Um-Ball Example: 



Forward and backward variables are evaluated for the um-ball example 
considered earlier. Recall for this example: 





*0.5 


0.3 


0.2' 




‘0.3 


0.5 


0.2‘ 


A = 


0 


0.8 


0.2 


, B = 


0.4 


0.4 


0.2 




0 


0 


1 




0.6 


0.3 


0.1 



n = [Q.9, 0.1, 0}, 



0 = {\, 2, 2,3;. 



Thus, using eq (2.16) leads to: 



a,( 1 ) = K,b,(0, ) = 0.9 b,(\) = 0.9 • 0.3 = 0.27 



2 ; = ) = 0. 1 • 1 ) = 0. 1 • 0.4 = 0.04 



a,(3) = z^b^(O,} = 0-b^(\) = 0. 

Next, using the recursion formula (2.17) leads to the following values for 
the forward variable Ot(i) for t=l, 2, , T-1 and 1^^. For example, 

a,{l) = iYa,[i)a,^)-b,{0,) 

/ = 1 

= (a,(l)-an +a^{2)■a2^ +aj(3)-a3,)-0.5 
= (0.27 0.5 + 0.04 O + O) -0.5 
= 0.0675 
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1=1 " 

= (S"i('K2)-^2(2) 

/=1 

= (ofj (l)’ 0,^2 ■*■ ^1 (2)‘ 0.22 '*’ ^1 (^) ’ ^32 ) 

= (0.27 • 0.3 + 0.04 • 0.8 + O) • 0.4 
= 0.0452 

«2(3) = (i«.(0^,^)-^3(C>2) 

/=1 

= (X«, (2) 

i = l 

= (orj(l)-<2[3 + 0^1 (2) -<323 +Q!^i(3)-a!33)-0.3 

= (0.27 0.2 + 0.04 -0.2 + 0) -0.3 
= 0.0186. 

All values of the forward variable are shown below in the matrix Afw: 

0 

0.0186 
0.012342 ’ 

0.00202298_ 

which is of dimension: 4x3 (TxN). Similarly, we compute the backward variable, 
according to eqs (2.20) & (2.21) starting from t=T. Recall from eq (2.20) that: 

= Vi = 1,2,3. 

Next, using the recursion formula given in eq (2.21) to compute Ps(i) leads to: 






0.27 0.04 

0.06 0.0452 

0.016875 0.022564 

0.0016875 0.00462274 
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;=1 

= f^a,^bp)!i,(\) 

y=i 

= [a,i •^i(3) + ai2 - 62 ( 3 ) + a,3 -63(3)]-^/lj 

= (0.5 • 0.2 + 0.3 • 0.2 + 0.2 • 0.1) • 1 
= 0.18, 



All values for the backward variable |3t(i) are shown below in the matrix Bbw as: 




0.027582 


0.022152 


0.009 


0.0726 


0.0636 


0.03 


0.18 


0.18 


0.1 


1 


1 


1 



Note, that the values in the last row are always 1, according to eq (2.20). At this point, 
we can evaluate Pr(0|X), using the forward variable, according to eq (2.18), which leads 
to: 

Pr(0 1 X) = X!, a^d) = ZH: ocji) = or, (1) + or, (2) + or, (3) + or, (4) 

= 0.0016875 + 0.00462274 + 0.00202298 
= 0.00833322000000. 

2. Problem 2: Optimal State Estimation 

Problem 2 deals with finding the optimal state sequence I for a given observation 
sequence O. One possible solution to this problem is to maximize the expected number 
of correct individual states by choosing the states it that are more likely to occur [3]. This 
computation uses the variable Yt(i), defined as the probability of being in state qt at time t, 
given the observation O and the model X [3]: 

r,{i)=fi{i,=‘h\0,^). (2.23) 
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Yt(i) can be expressed as: 




Pt{0\A) ' 



(2.24) 



where at(t), pt(t) are the forward and backward variables, defined earlier. Replacing eq 
(2.23) into eq (2.14) leads to: 






(2.25) 



Note that the normalization factor Pr(0|X) in the denominator of eq (2.24) is needed to 
make Yt(i) a conditional probability, which leads to the following constrain being 
satisfied: 



Er.O')'!- ( 2 - 26 ) 

i=l 

Finally, the optimal state sequence ii can be obtained by: 

i, = arg max[/, (/)], l<t<T. (2.27) 

l<i<N 

A more efficient approach to compute the optimal state sequence uses a decoding based 
on dynamic programming called Viterbi algorithm and shown in Table 2.4. The Viterbi 
algorithm is designed to find the best path (sequence) which maximizes the probability 
Pr(0, I|X) [3]. 
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Table 2.4 Viterbi Algorithm 



Initialization: 

<Pii}) = 0 

Recursion: 

= for2<t<T. l<j<N 

«’,0) = 2^gn>ax[<5„,('Kl 

l<i<N 

Termination: 

P* = max[(^j. (O] 

l<i<Af 

=argmax[(^j-(/)] 

l<i£Af 

Path (State Sequence) backtracking: 

(C,) Forr = r-l,T-2,-,l 



Application: 

As an application, let’s go back to example 3 defined earlier in section 3 to find 
the optimal sequence it, given the model X and the observation O. Recall that, for the 
given model: 





'0.5 


0.3 


0.2‘ 




'0.3 


0.5 


0.2' 


A — {<2/;} — 


0 


0.8 


0.2 


, B = {bjk}= 


0.4 


0.4 


0.2 




0 


0 


1 




0.6 


0.3 


0.1 



= {0.9, 0.1,0}, 0 = {l, 2,2,3} 



Thus, according to Table 2.4, we get: 
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d,(i) = 7t,b,(0,), l<i<N 
<5, (l) = 7iib,(0i j = 0.9 • 0.3 = 0.27 
<5, (2) = 7t2b^(0^ j = 0.1 • 0.4 = 0.04 
S^{3) = 7^^bJ(O^) = 0 
<Pi(i) = 0 

<5,(;) = max[5,.,(/)a,^j&^(6>,). l<j<N,so: 
‘52(yO=max[<5i(z>,^]&^.(C)2) 

^2 (l) = max[< 5 , {i)a., >, {O^ ) =max[< 5 , (l)a„,( 5 , (2>22„<5, (3>Z3, >, 



Similarly, after computing all 5 values, we define the matrix A, such as: 



A = {^,(y)}= 



0.27 0.04 

0.0675 0.0324 

0.016875 0.010368 
0.0016875 0.00165888 



0 

0.0162 

0.00486 

0.000486 



Then, we compute the cp variable, defined in Table 2.3 as: 
«’,(7) = argmax[j,.,(/)aJ. 

1</<V 



For example, <p2(l) is defined as: 

<Pi (l) = arg max[< 5 , {i)a., ] = arg max[< 5 , (2)02,.^, (3)^3, ] 

\<i<N 

= arg max [0.27 • 0.5,0.04 • 0,0] = 1. 



Similarly, we compute all other (pt(j) variable, and define the <E> matrix: 



^=k 0 )}= 



0 

1 

1 

1 



0 0 
1 1 

2 3 
2 3 



Next, we find the last state at time T =4 defined as: 

=argmax[<^2.(j)], 

which leads to: 
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■* = arg max [S^ {i )] = arg max[^j (l), (l), 5^ (3)] 



= argmax[(^3(l),J3(2),(^3(3)] = argmax[0.0016875 0.0016588 0.000486] 

= 1 . 

Finally, the other state sequence are defined using the backtracking method of the Viterbi 
algorithm given in Table 2.4, which leads to: 



Therefore, the optimal state sequence for this model and observation sequence is given by 
i,={ 1,1,1,!}, which shows that all balls come from the first urn. Note that repeating the 
same experiment with the observation sequence, 0={1,2,3,3}, results in ie={l,2,2,2}. 
Further, note that we always expect a forward moving state sequence, since our model is 
a left-right model. In addition, we can also use the variable 5t(j) to evaluate Pr(0|X), and 
thus we introduce the Viterbi probability Prv, such as: 



Using eq (2.28) leads to: 

Pr^ = max{^4 (/)}= maxj^^ (1) , 8^ (2) , 5^ (3)} 

l</<3 

= max{0.0016875 0.00165888 0.000486} 

= 0.0016875, 

which is consistent (meaning in the same range) to the value 0.00833 found earlier using 
the Baum-Welch probability eq (2.18). Note that we can’t expect to find the same exact 
value (since theoretically is not the same), but we can use both probabilities to reconfirm 
the decision. 




<p2[h)=<Pl{^) = ^- 



Pr^ =max{^r(0)- 



(2.28) 
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3. 



Problem 3: Re-estimation of Model Parameters 



The last problem deals with adjusting the model parameters (A, B, 7i), so that 
probability Pr(0|A.) is maximum. To that end, we define the variable ^(i,j) as the 
probability of a path being in state qi at time t and making a transition to state qi at time 
t+1, given the observation sequence and the model, such as [3]; 



Vr{0\X) 



(2.29) 



Recalling the definition of Yt(i) given earlier in eq (2.25) as the the probability of being in 
state qi at time t, given the observation O and the model X, and using eqs (2.23) and 
(2.24), we can relate Yt(i) to ^(i), by summing ^tO) over all states j, which leads to: 

= (2-30) 

>=1 

Similarly, summing Yt(i) and ^t(i) for all t’s, leads to the following result[3]: 
r-i 

^ 7 , (z ) = Expected number of transitions made from state q^, (2.3 1 ) 

r=l 



and: 



r-i 

(zj) = Expected number of transitions made from state to state qj . (2.32) 

/=! 

Finally, using eqs (2.31), (2.32) and the definitions of the model parameters, we can 
reestimate the model, according to the Baum-Welch formulas: 



1. ;r,. = yi{z), l<z<A^, 



(2.33) 
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(2.34) 



7-1 



Sf,(y) 



0 a = 

^ ij T-] 



lr.(0 



f=l 



Zy>(0 



3. bj(k) = 



— 



Sr,(0 



/=! 



(2.35) 



We then continue reestimating our model (applying the new model to the variables y and 
^), with the re-estimations formulas defined above, until we reach convergence, i.e., so 
that: 



~ ^old 



Application: Um-ball example. 

Next, we apply the re-estimation formulas to our um-ball model described earlier: 



Recall: 





”0.5 


0.3 


0.2' 




"0.3 


0.5 


0.2' 


A = \aij]= 


0 


0.8 


0.2 


, B = {^4= 


0.4 


0.4 


0.2 




0 


0 


1 




0.6 


0.3 


0.1 



, n = {;rj= {0.9, 0.1, 0}. 



For the observation: 0 = {\,2,2,3}, and a single iteration we get: 





'0.6256 


0.2945 


0.0079' 




'0.4362 


0.4649 


0.0988' 


A — 


0 


0.8984 


0.1015 


, fi = W= 


0.0712 


0.5573 


0.3714 




0 


0 


1 




0 


0.4697 


0.5302_ 



n = {;7:}= (0.8936, 0.1063 0}. 
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Note that, 1) the summation all rows of the matrixes A, B and vector n remains equal to 
1, and 2) A is still an upper triangular matrix, since our model is let-right. Re-estimating 
the model for 10 iterations leads to: 





'0 


1 


0 




‘1 


0 


0 


A = 


0 


0.3341 


0.6658 


, B = 


0 


1 


0 




0 


0 


1 




0 


0.3325 


0.6674 



n = u, 0, 0}. 

As we can see from the above results, the matrix A still remains an upper 
triangular, which means that the model is still a left-right model. Examining the B matrix, 
we conclude that, if we are in the first state, we will observe only the 1** symbol (1^* row, 
column =1), etc. Finally, the n matrix, shows that procedure starts from the first state. 

F. SCALING 

Numerical implementations of the forward-backward, Baum-Welch or Viterbi 
algorithm may lead to underflow problems due to the small numbers involved in the 
required computations. In addition, the problem may become worse, as the matrix 
dimensions involved in the computations increase. As a result, scaling is required to 
avoid such mathematical problems [4]. The basic idea behind the scaling procedure is to 
multiply the forward and backward variables Ot(i) and 3t0) by a coefficient so that the 
scaled d,{i) and are kept within the dynamic range of the computer. 

Note that we can rewrite the reestimation formula eq (2.34), in terms of the the 
forward and backward variables, as [4]: 

«« = • (2.36) 

E S"/ (;■) 

1=1 ;=i 
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Consider the forward algorithm used to compute of the forward variable a,(i), discussed 



earlier. ot,(i) is multiplied by the “scaling” factor Ct defined as: 

1 



c. = 



‘ N 






1=1 



leading to the scaled forward variable defined as: 

The scaled coefficient a, (/) can be shown to be equal to [4, pg. 272] 



(0 = ^ • 

r=l 7=1 

We also define a modified forward variable, as: 

;=i 

By induction can be written in terms of as: 

«,-i(7) = ff[c, Vi(;)- 
V J 



Thus: 






E4-i(y(nc,W(o,) . 

7=1 I r=l J _ 



N iV 



r-1 ^ 



E E ^'-1 (4 n ( O , ) E 

t=l 7=1 I r=l ^=1 



The scaled backward variable (/) is defined as: 

A (0 = c ,^,( 0 - 



(2.37) 



(2.38) 



(2.39) 



(2.40) 



(2.41) 



(2.42) 



(2.43) 
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At this point, we can rewrite the reestimation formula given in eq (2.36), using the scaled 
forward variable such as: 



: • (2.44) 



Similarly the re-estimation formula for bj(k) eq (2.35) becomes: 

bj{k) = ^ . (2.45) 



Finally, it can be proved that the probability Pr(0|X) and the Viterbi probability Pry can 
be computed using the scaling factor [4], which leads to: 

Pr(0|/l) = -J-, (2.46) 

n^- 

(=1 



or. 



log[Pr(0|A)] = -5;iogc, 

r=l 


(2.47) 


log(PrJ = max[^r(/)]. 


(2.48) 



G. MULTIPLE OBSERVATION SEQUENCES 

In real word applications we need to train the HMM using multiple trials, to 
ensure robustness in the recognition/classification process. Each training trial signal 
produces an observation sequence. Assume that there are K trials, i.e., K observation 

sequences. We denote the set of observation sequences O such as: 
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where: 






( 2 . 49 ) 



( 2 . 50 ) 

is the k* observation sequence. Provided that each observation sequence is independent 
of each other and identically distributed, we want to find the model which maximizes the 
probability: 



Pr(o I A) = npr(o«ii)=np.- 



( 2 . 51 ) 



i=l 



i=l 



Therefore, the multiple observation reestimation formulas using the scaled variables are: 



= 



/=! 

k ^\ /=1 



( 2 . 52 ) 




E«,‘(OA*0) 

rj, /=1 



( 2 . 53 ) 



Note that we don’t reestimate Tq, so we keep it unchanged through the iterations. 
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III. ISOLATED WORD RECOGNITION 



This section presents the application of HMMs to speech recognition. 
Specifically, we show how the model parameters have to be selected and adjusted using a 
limited 4-word recognition example. Our goal is to show how a more generic classifier 
can be set-up through the speech recognition example. In addition, we point out the 
potential difficulties one may encounter while setting up and implementing the software 
for such an application, due to computer numerical precision limitations. 

A. GENERAL HMM TRAINING AND TEST PROCEDURE 

This section describes the application of HMMs to classification in the context of 
speech recognition. The overall procedure is illustrated in Figure 3.1. First, labeled data is 
used to train the model. These specific training signals may be multiple trials of the same 
type of signal, i.e., belong to the same class, or belong to different classes. For example, 
multiple trials of the same word may belong to the same signal class in speech 
recognition applications. In such a case, all words used in the recognition set-up 
constitute the dictionary. Next, information uniquely characterizing each class needs to 
be extracted from the signal classes. Thus, each signal is split into T segments, and some 
useful features extracted from each. Feature vectors may include LCP or cepstral 
coefficients, energy, etc... Thus, the initial signals are converted into a set of continued- 
valued vectors. Next, this set gets converted into a sequence of discrete vectors using 
vector quantization (VQ) [4,5], which will be described further in Section 2. The set of 
M discrete symbols forms the codebook. For example, assume we have a speech signal 
divided into 4 segments (i.e., T=4), and that the set of symbols representing one signal 
segment is a number ranging from 1 to 8. Thus, a possible observation sequence of the 
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k* signal may be given as O'^ = {7, 3, 4, 8}. At this point, we can check whether the 
features extraction method and dimension of the codebook M makes sense by comparing 
the observation sequences obtained for the signals of the same class, as it is reasonable to 
expect some similarity between the resulting sequences. 

Recall that the set of observations derived from each class of training signals is 
the only information used to train a class-specific model X{A, B, it}. First, an initial 
estimate of the model is required to apply the Baum-Welch algorithm, as described 
earlier in Section II. Initial values may be set randomly so that they satisfy the model 
constraints and converge to the correct type of model, i.e., the initial matrix A is to be 
selected as upper triangular for left-to-right models so that it converges to an upper 
triangular matrix. The Baum-Welch algorithm iteratively estimates the model and stops 
when there is no significant model parameter changes between two successive iterations. 
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Figure 3.1 HMM General Training and Testing Procedure 
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Unlabeled signals are used during the testing phase to identify which class they belong to. 
First, testing data are pre-processed to generate the codebook vectors and corresponding 
observation sequences, following the same process as that used during the training phase. 
Next, the set of the probabilities observing the tested signal O, given the k* model, 
Pr(0|A,^''^) is computed using either the Baum-Welsh or the Viterbi algorithm for all k**^ 
models, as discussed earlier in Section II. Finally, the model type is selected by choosing 
that with the highest probability, Pr(0|X°’’‘) . 

Next, we describe a simple 4-word recognizer designed to recognize the words: 
Statistics, Microsoft, Instructor, and Professor. Three trial words are used for training and 
one word is use for testing for each class. 

1. Data Creation and Preparation 

All data were recorded on the same machine running Windows-98 Sound 
Recorder with a sampling rate of 8000Hz sampling and 8bit mono encoding. One male 
speaker was used. However, note that there are some variations between the trials so that 
no word is pronounced twice exactly the same way. An energy detector was applied to 
remove silence before and after each word, resulting in word of about 9000 points. Next, 
each signal was interpolated to 10000 points to obtain trials with the same length. Finally, 
each data was sent through a pre-emphasis filter with transfer function H(z)=l-az'* with 
0=0.98, to emphasize the relative energy of the high-frequency spectrum which contain 
useful information. The MATLAB software implementation is presented Appendix A.3. 

2. Feature Extraction 

We varied the length and number of segments (time frame), and the specific type of 
feature parameters until we obtained a combination which lead to correct classification. 
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The final set of parameters was derived by each word into T=7 segments using a 
rectangular window, with an overlap of 10%, where the time length of each trial was 
about Isec. The following eight parameters were extracted for later use in VQ from each 
segment: LPC coefficients from a 7* order filter derived with the covariance method [1], 
and the energy of the section. Finally, we normalize the value of the energy dividing all 
values by the max energy. 

3. Vector Quantization 

Vector quantization (VQ) is a scheme, which maps a sequence of continuous- 
valued vectors into a sequence with a given number of discrete vectors, called the 
codebook [7]. Therefore VQ can be viewed as some type of encoding scheme, where the 
encoder y assigns a channel symbol y(x) from an ensemble of M symbols to each input 
vector x={xo, Xi,..., Xk-i) [7]. Note that there is no need to define a decoder, as the 
discrete sets of parameters never get translated back into the original vector. 

Basically, VQ partitions the set of coefficients into M disjoint sets. Each set is 
represented by a single vector {vn,}l<m<M, which is a centroid of the vectors in the 
coefficient set assigned to the m'*’ region [4]. 

Note that there is a distortion penalty associated with VQ, as all feature vectors 
are represented using a set of M codebook vectors. The larger the dimension of the 
codebook, the smaller is the overall distortion between original feature vectors x and 
codebook vectors Xr. The distortion measure d(x,Xr) resulting from the codebook selection 
process can be represented using the Euclidean norm as [7]: 

= . (3.1) 
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For our purpose two schemes were applied to derive the codebook vectors: 1) a 
competitive Neural Networks implementation, and 2) a K-means (or LGB algorithm) 
algorithm [7, 8]. 

a) Competitive Neural Network Implementation 
Input Competitive Layer 




R S 



Figure 3.2 Competitive Neural Network 

A competitive unsupervised neural network (NN) implementation was 
selected to compute the codebook, as shown in Figure 3.2 [2]. Basically, this NN can be 
viewed as a clustering scheme, where weights associated to each neuron are used to 
assign a symbol M to each word segment. The software implementation is presented in 
Appendix A.l. 
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Vectorl , Qass=2 



Vector2, Oass=4 



Vectors, Qass=4 






Vector4, Class=3 



Vectors. Oass=3 



Vectors, aass=3 




Figure 3.3 Diagram of coefficient vectors and VQ vectors, Euclidean distance for the 
word “statistics”, using a competitive neural network 

Figure 3.3 shows seven original feature vectors, and the corresponding 
quantized vectors obtained with the competitive NN. Training took about 2000 epochs 
and 50mn with a Pentium-HI 450MHz. Each initial feature vector is mapped into one of 
the M=4 quantized vectors. For example, vectors 2 and 3 get mapped to the same fourth 
class due to their consistency. Similarly, vectors 5, 6, and 7 get mapped to the 3^*^ class of 
the quantized vectors. 

b) K-means Scheme 

The second method considered is the K-means algorithm, also called the 
Lloyd or Linde-Buzo-Gray (LBG) algorithm [9]. The software implementation is given 
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in Appendices A.13 and A. 14 [8]. Basically, the K-means algorithm is a clustering 
scheme, which iteratively finds a set of k, quantized vectors into which all training 
vectors get mapped to with minimum distortion. The number of clusters increases 
iteratively by splitting the existing quantized vectors obtained at each iteration, until a 
desired number of quantized vectors or distortion levels is obtained [9]. Thus, the size of 
the codebook is a power of two, i.e., 2, 4, 8, 16, etc... The LGB algorithm is illustrated 
for the word “Microsoft” in Figures 3.3 and 3.4 for codebook sizes equal to 4 and 32 
respectively. 
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Vector4, Class=4 



VectofS, Class=1 Vectors, Class=2 
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N4=32 

Word: ’Microsoft' 



* Oroginal Coefficients 
— Quantized Vectors 



6 8 



Figure 3.4 Diagram of coefficient vectors and VQ vectors, Euclidean Distance for word 

“Microsoft,” using the K-means (LGB) algorithm and M=32. 

Note the distortions are much smaller when M=32 than when M=4. This 
is to be expected as a larger codebook size allows more flexibility. However, a large 
codebook may not be always desirable as it may lead problems in evaluating and 
comparing the resulting probabilities Pr(OI^), as we will see later. The K-means 
implementation is a little faster than the NN technique, however it is restricted to 
codebook sizes which are powers of two. No such restriction is needed for the NN 
implementation. 

Note that an initial selection of the number of segments T and the size of 
the codebook M may be made by observing a small selection of the resulting quantized 
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vectors. The basic idea is to select a combination of parameters, which lead to some type 
of consistency in the resulting quantized vectors in a given class. 

4. HMM Training 

This section describes the application of the HMM training theory presented in 
Section II to the speech recognition example. The HMM implementation uses MATLAB 
5.3, and is presented in Appendix A. MATLAB may not be the most desirable software 
language as it is relatively slow, however it was selected for ease of implementation to 
test the concepts. 

a ) HMM Initial Conditions 

Usually, a left-to-right model is preferred over the more general ergodic 
one in speech applications because model states can be associated with time in a 
straightforward manner [4]. Thus, we selected a left-to-right model, where the matrix A 
is upper triangular. In addition, we also prefer the model to start from an early state so 
that it can pass from all possible states. As a result, we didn’t use random values for the 
vector n, but instead we initialized the first coordinate to be quite large, the second one 
smaller, and so on, and applied the constrain that the summation of all values in n is 
equal to one. We use random values satisfying the appropriate constraint for the matrix 
B. 



b) Number of States N 

HMM states are called “hidden” because all information about the state 
sequence is not accessible directly, since the only information available is given by the 
observations. According to Rabiner [4], there are two schools of thought for the physical 
meaning of the number of states; the first one states that it represents the number of 
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sounds (i.e., phonemes) for each word, which is usually a number between 2 and 10. The 
second interpretation is that it represents the average number of observations in a spoken 
version of the word. Basically, we can assume that the number of states represents the 
number of distinct sounds (e.g., phonemes or syllables) of the word. In our case a 
number of states N equal to 4 was deemed appropriate, since our vocabulary contains 
only 4 words. 

Further investigations may be required when applying HMMs to other 
types of signals if we cannot relate the number of states to some physical meaning behind 
the data. At worse, the number of states can be selected by trial and error at that leading 
to the highest recognition results. Note that selecting too small a number of states may 
prevent from differentiating between classes. Simulations showed that selecting too large 
a number of states may result in overly long training time and numerical instability in the 
implementation which cannot be controlled with scaling. 

c) HMM Re-estimation 

The HMM re-estimation iterative scheme implemented uses the multiple 
observation sequence technique described earlier in Section 2. The MATLAB software 
implementation is given in Appendix A.7 to A.9. First, we re-estimate the model 
parameters for every different trial, and then we apply these results in the re-estimation 
formula in eqs (2.52) and (2.53). 

The HMM re-estimation step uses the scaled forward and backward 
variables to avoid numerical instabilities, and its implementation is given in Appendix 
A.6. In addition, we ran all MATLAB files using a scaled fixed point format with 15 
digits (format long). Since, some computations still resulted in “divided by zero” errors 
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after applying these corrective measures. We forced the denominator quantities to be 
equal to the smallest floating point number whenever there were found to be smaller. 

We used multiple iterations for each single observation HMM re- 
estimation step until convergence was reached. We noticed that the model doesn’t 
necessarily converge to the same parameters for the same observation, when repeating 
the re-estimation procedure with random initial conditions. However, correct decision is 
still achieved. 



d) Scoring 

The system is ready to test any word and categorize it as one the four class 
types after the four models k={ 1, 2,3,4} derived separately for each word (Microsoft, 
Statistics, Instructor and Professor) are identified. We repeat the same feature extraction 
scheme for the testing words, using either the neural network or the codebook derived in 
the K-means algorithm during the training process. Let’s assume that the observation of 
the testing word is O. During testing, we evaluated the probability Prbw(OlA,^^) for 
k=l,...4, using either the Baum-Welch or the Viterbi algorithm, and selected the model 
which gives the highest probability Pr(OlX,^^) (Appendix A. 10, A.l 1). 
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model: 



Classification of testing words using all models with parameters: 

M=4, N=8, T=7 




-1200 -1000 -800 -600 -400 -200 0 



-250 -200 -150 -100 -50 0 



Tested words: 1: Microsoft, 2: Statistics, 3: Instructor, 4: Professor 
Class models: Microsoft, Statistics, X^^^: Instructor, X^"'^:Professor 

Pbw[c 1B]: 101ogioPrbw(0|X) 

Pvitcrbi[<^B]: 1010gioPrv(0|?l) 

: Correct Decision (100% success) 

Figure 3.5 Scoring of all 4 testing words (1: Microsoft, 2: Statistics, 3: Instructor, 4: 
Professor) given each model Parameters: M=4, N=2, T=7. This system 
performed 100% successful decisions. 



Figure 3.5 presents the results obtained when the number of segments T is 
equal to 4, the number of symbols M is equal to 4 ( computed with the LGB algorithm), 
and the number of states N is equal to 8. The four horizontal bars contained in each word 
represent the probabilities lOlogioCPCOj^L^*^^)), k=l,...4. Thus, the highest probability is 
that closer to the OdB point, which represents the point of probability 1. The left column 
plots represent the results obtained with the Baum-Welsh algorithm while the right colum 
plots represent those obtained with the Viterbi algorithm. Recall that the four models 
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were trained with three trials for every class. The software 
implementation for this testing step is given in Appendix A.ll. Results show that correct 
classification is obtained in all 32 cases, as shown in Figure 3.5. 



model: 



Classification of all testing words using all models with 
parameters; M=4, N=2 (inappropriate), T=7 




-1200 -1000 -800 -600 -400 -200 



BW*-' 



-200 



-150 



-100 



-50 



Tested words: 1: Microsoft, 2: Statistics, 3: Instructor, 4: Professor 
Class models: Microsoft, Statistics, X^^^: Instructor, A^"^^:Professor 

PBw[dB]: 10IogioPrbw(0|X) 

Pvitcrbi[dB]: 101ogioPrv(0|?i) 

<J=n : Correct Decision 




Wrong Decision 



Figure 3.6 Scoring obtained for all four testing words (1: Microsoft, 2: Statistics, 3: 
Instructor, 4: Professor) given each model Model Parameters selected: M=4, 
N=2, T=7. This HMM has too few states (N=2) and cannot classify the word 
‘instructor’ correctly. 



46 



Figure 3.6 presents the results obtained when selecting too small a number 
of states (N=2). All other parameters are kept the same as those in Figure 3.6. Note that 
the algorithm reaches the wrong decision for the word “instructor” which was found to be 
“professor.” 

B. CONCLUSIONS 

This section presented an HMM-based classifier applied to a simple 4-word 
recognition problem. We implemented and tested the classifier using MATLAB, and 
described how we selected the various parameters which need to be selected to set-up the 
classifier. Next Chapter considers the application of the HMM-based classifier to 
seismo-acoustic signals to differentiate between two types of mine-like objects buried in 
shallow sand. 
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IV. HMM-BASED CLASSIFICATION OF SEISMO-ACOUSTIC MINE 

SIGNALS 

HMMs can be applied to various types of classification problems. This Section 
presents the results obtained for the classification of mine-like objects buried in sand. The 
mine data was obtained from Prof. Muir who heads a buried mine detection project 
started in November 1996 at the Naval Postgraduate School. Initial results obtained are 
described in Gagham [10], Fitzpatrick [11], and Hall [12]. 

The NPS mine project is a continuation of work started at the Applied Research 
Laboratory of the University of Texas at Austin, and is sponsored by the Office of Naval 
Research. The main goal of the project is to study the development of a seismo-acoustic 
sonar for the detection of buried ordnance using guided, seismic interface waves. Earlier 
ARL and NPS results showed that seismic interface (Rayleigh) waves can be used to 
detect mine-like objects buried in sand [10-12]. The goal of the current NPS program is 
to develop an improved seismic source to evaluate the feasibility of using a seismo- 
acoustic sonar to detect buried ordnance in the beach and surf zones. The seismic waves 
were generated by two actuators, and were measured by two three-axis sensors- 
geophones. Buried, mine-like objects, ranging from 71kg to 290kg, and at ranges of up 
to 5 meters were echo-located by applying a basic polarization filtering signal processing 
scheme. 

This section applies the HMM concepts derived earlier to distinguish between two 
types of mine-like objects. Two types of experiments were conducted: 1) classification of 
two different mine-like objects at the same range with multiple weights for each mine 
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type; and 2) classification of two different mine-like objects at 3 different ranges with the 
same weight for each mine type. 

A. BASIC EXPERIMENT INFORMATION 

This section does not present any theoretical, physical or experimental details for 
this project, as they may be found in in [10-12]. However, we do provide some basic 
information regarding the physical nature of the signals under study. 

The beach site used for collecting the data is a stretch of U.S. Navy-owned beach 
directly seaward of NPS, Monterey, CA, and shown in Appendix B.2. This area 
measured roughly 150 feet in length running parallel to the waterline and varied from 20 
to 50 feet from the high-to-low water mark. Generally, the sand conditions varied due to 
the different waterline distance, resulting in changes on some of the sand characteristics, 
such as density and moisture in multiple depth layers. 

The equipment configuration for the mine detection is illustrated in Appendix 

B. 6. Basically, two actuators produce the Rayleigh waves described in [11]. Rayleigh 
waves have distinctive features that make them identifiable in a complex seismo-acoustic 
wavefield. The most important property of Rayleigh waves is the elliptical particle 
motion produced by their passage. Their unique characteristic is the 90° phase shift 
between their horizontal and vertical components, which results in elliptical motion, as 
illustrated in Appendix B.7. Thus, the placement and operation of the actuators, as 
illustrated in Figure 4.1, can be categorized into two different configurations: 1) actuators 
operated horizontally; and 2) actuators operated vertically, with a 90° relative phase 
difference between their drive signals [11]. 
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□ 




Figure 4.1 Actuator placement for Rayleigh waves generation [11]. 



The generated seismo-acoustic wave travels through the geophones and meets the 
target, as illustrated in Appendices B.6 and B.8. Next, the reflected wave passes through 
the geophones again, and the received signal is used to detect the target presence. The 
signal is collected, as described in [12], and the signals for the relative x, y, and z-motion 
are shown in Appendix B.5. 

Note that the target reflection is not visible from the raw signals. Thus, vector 
polarization (VP) [12] was used to extract the Rayleigh waves from the raw information. 
The VP step takes advantage of the 90° phase shift between vertical and radial 
components to pull the target information out. This is accomplished by applying the 
Hilbert Transform to the radial -vertical signal pair [11], the complex crossed-power is 
computed such as [12]: 



^rv ^hilben ^ ^ hilben 



(4.1) 



the imaginary component of Prv is essentially proportional to the intensity of the seismic 
wave due to the 90° phase shift between vertical and radial components [12]. The 
polarity of the imaginary power is associated with the rotation of the elliptical motion of 
the wave (e.g., a negative value corresponds to an anticlock-wise motion). Real and 
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imaginary components of the cross-power Pp, are illustrated in Appendix B.9. Appendix 
B.8 shows the incident wave passing the geophone at 10ft. The reflected target signal 
strength is quite small, and we need to expand the scale to see it, as explained further 
later. 

Two mine-like objects were used: 1) a cylinder weighing 1501b (68kg), 5ft long, 
8in (20cm) in diameter, with a '/i-in (0.6cm) wall thickness; and 2) a U.S. Navy power 
can or “power keg” with the shape of a cylindrical sheet metal can 18in (46cm) high and 
24in (61cm) in diameter, weighing 161b (7kg). These objects are shown in Appendix 
B.3. Additional weights were used to vary the objects total weight. These additional 
weights made the maximum weight of the cylinder and the keg equal to 6181b (280kg), 
and 6401b (290kg) respectively. The cylinder was always buried on its side with the 
cylindrical axis horizontal and in the direction of the wave propagation. The powder keg 
was always buried upright, with the cylindrical axis vertical to the direction of the wave 
propagation. Further details are given in Appendix B.4. 

B. SIGNAL SELECTION 

Recall that the target signal is visible after processing the raw information using 
VP. Thus, we use the signals obtained from the imaginary portion of the cross power for 
our application only. We also focus on the experiments conducted on the 6**“ and 10* of 
November 1998, were the cylinder and the keg targets were used, as shown in Appendix 
B.6. Two parameters were varied during these two days in addition to the sea conditions: 
target weights and distances from the geophones. Ten trials were conducted for each 
experiment. The distances between the different pieces of the equipment (actuators, 
geophones, target) were set as shown in Appendix B.6. As a result, the actuator- 
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geophone distance was set at 10ft, the actuator-target distance: 16ft, and the actuator- 
target-geophone (reflected) was 22ft for both targets (cylinder and keg). Appendix B.8 
shows the average imaginary cross power obtained for all trials associated with a specific 
target and weight, scaled near the distance of 22ft. Appendix B.9 shows that the reflected 
target signal becomes stronger as the target mass increases. Appendices B.IO and B.ll 
plot the imaginary cross-power signals obtained from the cylinder and the keg targets 
respectively. We used the signal from the 2“^* geophone for the cylinder case, as it is the 
strongest of the two, and the signal from the 1^‘ geophone for the powder keg, as the 2“^* 
geophone doesn’t show any received signal. However, the signal measured at the 2“^* 
geophone for the keg experiment is used as an example of background noise (no target) 
signal. This background noise is used as part of the testing data, as described later. 

C. SIGNAL PREPARATION 

Plots contained in Appendix B.IO and B.ll show there is some consistency 
between the signals obtained from multiple weights of the same target. Thus, we elected 
to consider all signals associated with a given target to the same class, independent of the 
weight. Five trials were used for training, and the 6* trial was used for testing. The 
following weights were used for the keg target: 1) 2241b; 2) 4321b; and 3) 5361b. 
Similarly, the following weights were used for the cylinder target: 1) 3641b; 2) 4681b 3) 
5721b. Next, we assumed that we detected the presence of a non-background signal 
using some type of threshold detector, and truncated the signals used for the HMM set-up 
around the position of the target location. Signals used for the HMMs training and 
classification are shown in Figure 4.2. Note that it would be useless to include the 
incident signal received at a geophone from the actuators, as it doesn’t contain any 
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information about the target, and is much stronger than the signal reflected from the 
target. Each signal is 400 data points long. Two classes (keg and cylinder) are 
generated. At this point, the goal is to recognize the shape of the target, and each class 
needs to be characterized by a set of features, as explained next. 



Powder Keg 224!b 






Cylinder 3641b 




Cylinder 4681b 




Cylinder 5721b 




Figure 4.2 Imaginary component cross-power for the keg and cylinder targets for 3 
different weights, truncated near the target location. Each plot illustrates 6 different trials 
(for the same target, and weight) 

D. FEATURES EXTRACTI0NAT:CT0R QUANTIZATION 

Note in Figure 4.2 that the signals contain little spectral information. The 
frequency of the Rayleigh waves obtained during the November 6* and 10*^ experiments 
was 80ELz. As a result, the signals contain very little useful spectral information. We 
tried to apply LPC feature extraction, and other similar spectral coefficients, but we 
found little or no consistency between the trials of a given class. Therefore, we simply 
used the time-domain information directly. We segmented the signals into two segments 
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(T=2), and decimated each frame by a factor of 20:1, resulting in 10 data points over each 
segment, as shown in Appendix C.4. Finally, we normalized those points to compensate 
for the differences in signal strength coming from different weights, by dividing every 
value with the maximum one. 

We applied the LGB algorithm at the VQ stage and selected M=8 symbols. The 
code implementation is given in Appendix C.3. 

E. HMM TRAINING FOR THE MULTIPLE WEIGHTS EXPERIMENT 
HMMs need data to be trained. However, in practice the availability of data may 

be seriously limited for various reasons and cross-validation needs to be used [13]. 

This study has only a limited number of trials available: six trials for every type of 
mine-like object. Thus, we used cross-validation by successively selecting five trials for 
training and the last one for testing. The procedure was repeated six times rotating the 
testing signal each time. The code implementation is given in Appendices C.5 and C.6. 
We found that the best performance is achieved with the number of HMM states N equal 
to four. 

We created two ergodic HMM models: one for the keg class and one for the 
cylinder. Thus, we used a total number of 5(trials)x3(types of weight)=15 signals for the 
HMM training for every class (i.e., model). We selected an ergodic model as it 
performed better than the left-to-right model for the data. 

F. MULTIPLE WEIGHTS SCORING AND RESULTS 

The MATLAB file sequence.m given in Appendix C.6 performs all HMM 
training and scoring. Note that the testing signal is rotated each time, and the results 
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plotted, as shown in Figure 4.3a (cylinder training case) and in Figure 4.3b (keg training 
case). 




Pbw[ 4B]: 101ogioPrbw(0|^) 



PviterbildB]: 101ogioPrv(0[X) 

< : Correct Decision 

< - X— : Wrong Decision 

Figure 4.3a HMM testing results for the cylinder model (multiple weights). 

Prbw(0|XcyUnder) and Prv(0|XcyUnder) for all testing signals and all rotations (every row). The 
model XcyUnder is created by training all cylinder signals (5x3=15). The decision is correct 
whenever the tested signals (4,5,6) have a higher Probability (i.e., closer to 0 on a dB 
scale) than 1, 2, 3, or 7 (keg signals and background). Each row represents one of the six 
iterations of the rotation between testing and training signals. 
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PBw[dB]: 101og,oPrbw(0|X) 

Pvi«rti[dB]: 101og,oPrv(0|A) 

< : Correct Decision 

< X : Wrong Decision 



Figure 4.3b HMM testing results for the keg model (multiple weights). 
Prbw(0|A,iceg) and Prv(0|Xkeg) for all testing signals and all rotations (every row). The 
model Xiceg is created by training all keg signals (5x3=15). The decision is correct 
whenever the tested signals (1, 2, 3) have a higher probability (i.e., closer to 0 on a 
dB scale) than 4, 5, 6, or 7 (cylinder signals and background). Each row represents 
one of the six iterations of the rotation between testing and training signals. 



We used as background (non target) signals, the six trials obtained from the 2 °^ 
geophone during the keg experiment which didn’t track any target signal. Figures 4.3a 
and 4.3b show that an decision error for one case only. The signal considered in that set- 
up is the 4* tested signal, i.e., the cylinder with weight of 3681bs. Figure 4.2 shows that it 
is the weakest signal of all the mine signals. The total number of testing data is 
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